ctGAN: combined transformation of gene expression and survival data with generative adversarial network

Abstract Recent studies have extensively used deep learning algorithms to analyze gene expression to predict disease diagnosis, treatment effectiveness, and survival outcomes. Survival analysis studies on diseases with high mortality rates, such as cancer, are indispensable. However, deep learning models are plagued by overfitting owing to the limited sample size relative to the large number of genes. Consequently, the latest style-transfer deep generative models have been implemented to generate gene expression data. However, these models are limited in their applicability for clinical purposes because they generate only transcriptomic data. Therefore, this study proposes ctGAN, which enables the combined transformation of gene expression and survival data using a generative adversarial network (GAN). ctGAN improves survival analysis by augmenting data through style transformations between breast cancer and 11 other cancer types. We evaluated the concordance index (C-index) enhancements compared with previous models to demonstrate its superiority. Performance improvements were observed in nine of the 11 cancer types. Moreover, ctGAN outperformed previous models in seven out of the 11 cancer types, with colon adenocarcinoma (COAD) exhibiting the most significant improvement (median C-index increase of ~15.70%). Furthermore, integrating the generated COAD enhanced the log-rank p-value (0.041) compared with using only the real COAD (p-value = 0.797). Based on the data distribution, we demonstrated that the model generated highly plausible data. In clustering evaluation, ctGAN exhibited the highest performance in most cases (89.62%). These findings suggest that ctGAN can be meaningfully utilized to predict disease progression and select personalized treatments in the medical field.

Recently, deep learning algorithms have been extensively used to investigate gene expression for disease diagnosis, treatment, and survival analysis.Numerous studies have aimed to identify genes with the greatest relevance to specific diseases and use them to predict patient prognosis [20][21][22][23].Additionally, deep learning algorithms can predict patient prognosis based on diverse treatment methods and anticipate disease progression, including metastasis [24,25].Survival analysis research on diseases with high fatality rates, such as cancer, is being actively conducted [26,27].
Gene expression has been actively studied for several decades, resulting in the accumulation of large-scale datasets such as The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus [28][29][30].However, the size of these datasets is insufficient for training deep learning models.Advanced deep learning studies typically use tens of thousands of samples; for instance, MNIST comprises 60 000 images and CIFAR-10 includes 50 000 images for training.By contrast, in the TCGA dataset, the maximum number of samples for a specific cancer type is ∼1000.Although substantial data are crucial, the collection of an unlimited number of samples from individuals is not feasible and incurs high costs for measuring the expression of a single gene.These challenges hinder the implementation of deep learning models.Owing to the numerous genes and relatively small sample size, overfitting problems frequently arise during training [31].Nevertheless, gene expression analyses can provide new insights.
To address this problem, recent studies have focused on transforming gene expression data across species, for instance, by leveraging genomic data recorded from human and mouse models to develop statistical methods such as EBT and mEBT [32,33].These models aim to identify genes with high response concordance to overcome the challenges posed by discrepancies between animal and human responses.In addition, several studies have developed deep generative models for generating plausible gene expression data using style transformation.Representative examples of deep generative models include variational autoencoders (VAE) [34,35] and generative adversarial networks (GANs) [36,37].Various models have been developed for specific applications.For instance, Lotfollahi et al. [38] presented trVAE, and Russkikh et al. [39] proposed stVAE, both of which implement style-transfer methodologies in conditional VAE (CVAE) to generate gene expression.trVAE overcomes the limitations of CVAE using the maximum mean discrepancy in the decoder layer.It demonstrates high robustness and accuracy in predicting the conditions and classes for images and single-cell RNA-seq data.stVAE utilizes CVAE, Yautoencoders, and adversarial feature decomposition for RNA-seq data harmonization using technical factors or biological details as style components to facilitate style transfer.It exhibits high performance in terms of style and semantic prediction accuracy.
Style transformation techniques not only enable the prediction of experimental outcomes, which, in turn, reduces the time, cost, and risk associated with data acquisition, but also enhance model accuracy when sufficient data is available.However, both models exclusively concentrate on generating transcriptomic data, resulting in reduced clinical utility.To be applicable to clinical purposes, the generation should be extended to include clinical outcomes.Therefore, we developed a style-transfer method capable of generating both gene expression and clinical data to enhance survival analysis.Instead of using conditional generation, we leveraged a cyclic generation model for data augmentation through style transformation between breast cancer and 11 other cancer types.Training a cyclic-generation model is more challenging than training a conditional model.However, we addressed this issue by modifying the internal structure of the generator and discriminator to better fit our data and by employing a training technique that enhanced training stability.
We showed that the accuracy of survival analysis was not merely owing to an increase in the quantity of data.Using visualization tools, we demonstrated the distribution of the data, ensuring that the generated and reconstructed data closely approximated the real data.Additionally, in the clustering evaluation, ctGAN demonstrated the highest performance in most cases.Our approach enables the ref lection of varying survival times for different cancer types, thereby enhancing the effectiveness of the survival analysis.Furthermore, ctGAN outperforms existing methods.

Model architecture
A deep generative style-transfer architecture called ctGAN is proposed to enhance survival analysis through gene expression and survival data augmentation.Gene expression and survival data were obtained from TCGA, which encompasses 33 different cancer types.In this study, we focused on 12 specific types of cancer, including breast invasive carcinoma (BRCA).Among these 12 cancer types, BRCA had the highest sample count, with 1051 samples, and the second-highest sample count was 530 samples; however, this was still insufficient for conducting survival analysis.
This discrepancy between the high number of genes and the limited sample size presents a significant challenge when applying deep learning methodologies.A deep learning model exists that automatically selects genes during training using various regularization techniques [40].However, this approach does not resolve the fundamental issue of an insufficient sample size.Therefore, we utilized a style-transfer methodology to augment the sample sizes by transforming gene expression and survival data between different cancer types.Our experimental results demonstrate that this approach is beneficial for cancer survival analysis.
Because BRCA has the largest dataset among the 12 cancer types, we conducted transformations between BRCA and the other 11 cancer types.Consequently, as shown in Fig. 1A, new gene expression and survival time data were generated for all 11 cancer types, matching the sample count of BRCA.Moreover, BRCA itself received newly transformed gene expression and survival time data from the 11 other cancer types.
As shown in Fig. 1C, cyclic-consistent GAN [41][42][43][44]  For the GAN loss, we applied a least squares loss instead of a negative log-likelihood objective to stabilize the model training procedure as follows: In the case of cyclic and identity losses, we applied the L1 norm as follows: Therefore, our full objective is calculated as follows: Because the original domain-transfer GAN was initially designed for style transfer in image data, we modified the internal architecture of the generator and discriminator to make it applicable to gene expression.Skip connections and fully connected layers were incorporated into the generator of ctGAN to achieve an effect similar to that of the ResNet architecture [45].As shown in Fig. 1B (ii), the residual block comprised two hidden layers.The first hidden layer included one fully connected layer [ 46], batch normalization [47], and ReLU activation [48], whereas the second hidden layer comprised one fully connected layer and batch normalization.As shown in Fig. 1B (i), the generator was composed of two hidden layers, each comprising a single fully connected layer, batch normalization, and ReLU activation, followed by three residual blocks, two additional hidden layers, and a fully connected layer.
As shown in Fig. 1B (iii), the discriminator comprised three hidden layers, each containing one fully connected layer and leaky ReLU activation [49], culminating in the final fully connected layer.

Model training and parameter setting
Training and fine-tuning cyclic-consistent GAN models using conventional cross-validation methods may be challenging due to the simultaneous training of the generator and discriminator.Because adjusting one model can affect the other, accomplishing an adequate balance is critical [50,51].A discriminator loss converging to near zero suggests that the discriminator learns faster than the generator, hindering proper generator training.Additionally, the generator depends heavily on cyclic and identity losses during training, resulting in insufficient learning of the GAN loss.To address this issue, we adjusted the training method, enabling the generator to undergo three times more epochs than the discriminator, and repeated the training three times on the same dataset [52].Regarding the loss functions, we used the mean square error (MSE) loss for the GAN loss and L1 loss for the cyclic and identity losses.
Regarding parameter settings, we assessed the model's performance in terms of concordance index (C-index) [53]

Survival analysis method
To evaluate survival analysis, we used supervised principal component analysis (SuperPC), which is a widely used method for survival analysis based on gene expression.SuperPC was introduced by Bair et al. [54] and is particularly beneficial when dealing with high-dimensional data, wherein the number of features significantly outweighs the number of samples.This method closely resembles traditional principal component analysis except that it utilizes a subset of predictors chosen based on their correlation with the outcome.This method is applicable to both regression and generalized regression problems, including survival analysis.
Using SuperPC, we assessed the improvement in the C-index using ctGAN.SuperPC was implemented in Python.

Dataset
The multi-omics data utilized in this research involved RNAsequencing data representing mRNA expression.Gene expression and survival data were collected from TCGA, which included 33 different cancer types.Among these, only samples labeled as tumors were extracted and matched with survival data using sample IDs.After excluding cases with a survival time of 0 days, we selected only those cancers with 300 or more samples and at least 50 instances wherein the event occurred.Within the context of survival analysis, an event corresponds to either an occurrence of the outcome of interest or a censoring point indicating that the outcome had not occurred by the end of the study.Twelve cancers were included in this study (Table 1).

Gene selection
The raw gene expression data comprises 60 660 ensemble IDs.However, not all genes are significant for survival analysis.Therefore, if the goal of data augmentation is to improve survival analysis, genes relevant to survival analysis should be chosen.For instance, when augmenting colon adenocarcinoma (COAD) data using ctGAN, it is essential to select genes related to COAD survival analysis.Therefore, we calculated the C-index for each gene using the Cox proportional-hazards (CoxPH) model.
During this process, we obtained 60 660 C-index values, with higher values indicating genes that were more closely related to COAD survival.Consequently, we selected the top 300 genes and used the same set of genes as in the BRCA data as well.
The selected gene sets for data augmentation through style transfer varied depending on the targeted cancer type.The experimental results, including those obtained when varying the number of genes used or randomly removing genes, are depicted in Figs S2 and S3.Additionally, Fig. S4 shows the outcomes of the experiment integrating significant genes from both the targeted and source cancer (BRCA), which reveal that employing only significant genes from the targeted cancer yields better performance.

Overview of style transfer workflow
Figure 2 presents an overview of the style transfer workf low from cancer types A to B. Data preprocessing was performed on gene expression and survival data from real datasets A and B, which exclusively contained tumor samples.The preprocessing involved the removal of samples with 0 survival days and lognormalization with X new := log 2 (X + 1), where X is the expression value of the normalized fragments per kilobase transcript per million mapped reads, and survival time.Next, the top 300 significant genes relevant to the survival analysis of B were selected based on the gene expression of both A and B. Consequently, 300-dimensional gene expression and survival time data for both A and B were fed as input into ctGAN, which, in turn, generated new gene expression and survival time data.Subsequently, by concatenating the real A event data with the generated B data, complete survival data were created.The generated B data were then used in the survival analysis alongside the real B data.Here, if we presume A as BRCA and B as COAD, ctGAN will generate gene expression and survival time data for 366 BRCA and 1051 COAD samples, because the original sample size of BRCA was 1051 and COAD was 366.
Further details regarding the architecture and training of trVAE and stVAE, along with a comparison to ctGAN, can be found in the Supplementary Methods section and Figs S5 and S6.

Evaluation of survival analysis enhancement
ctGAN has been suggested to enhance survival analysis by transforming gene expression and survival data.The performance of the ctGAN was evaluated and compared to that of existing methods: trVAE and stVAE.The C-index and log-rank p-value were used as evaluation metrics for survival analysis.The C-index is among the most frequently used metrics for survival analysis.It does not evaluate the exact survival time of a subject but rather compares the survival times of multiple subjects [ 53].The logrank p-value indicates the significance of the two groups when plotted on the Kaplan-Meier (KM) curve as high-and low-risk groups [55].A p-value less than 0.05 is considered statistically significant.
To illustrate the enhanced accuracy of survival analysis achieved with the generated data, we initially divided the real dataset into training and test sets.Subsequently, the data generated by ctGAN was integrated into the training set.We then trained the SuperPC model and evaluated its performance using the C-index on the test set data.Given the variability of the C-index depending on the dataset, we conducted 100 rounds of cross-validation by partitioning the real data into different random state numbers.As shown in Fig. 3A, based on the median value from these 100 rounds of cross-validation, ctGAN exhibited improved C-index values in nine out of 11 cancer types, excluding KIRC and LUAD.For those nine types of cancer, except for HNSC and LGG, the analysis including generated samples exhibits statistically significant differences (p-value <0.05) when compared to using only real data.Notably, the model exhibited the most significant performance enhancement for COAD (median Cindex value increased from 0.656 for only real samples to 0.759 when including generated samples), representing a performance improvement of ∼15.70%.In addition, ctGAN outperformed the other methods in seven cancer types, excluding HNSC, KIRC, LGG, and LUAD.Among these, the analysis of five cancer types (BLCA, COAD, LIHC, OV, SKCM) using ctGAN displayed statistically significant differences (p-value <0.05) compared to both trVAE and stVAE.This result confirms the robustness of Conversely, compared to other models, trVAE yielded the highest performance improvement for only two cancers (LGG, LUAD), with only LUAD demonstrating statistical significance (p-value <0.05) compared to both ctGAN and stVAE.Similarly, stVAE exhibited the highest performance improvement for only two cancers (HNSC, KIRC), but none of these two cases resulted in statistical significance (p-value <0.05) when compared to both ctGAN and trVAE.These results confirm that ctGAN achieved greater performance enhancement and statistical significance than trVAE and stVAE.Fig. S7 depicts an additional experiment exploring the Cindex improvement when only generated data are used.
ctGAN generated gene expression and survival time data through style transfer between BRCA and 11 other types of cancer.Alternatively, gene expression was generated using ctGAN, while survival time data were taken directly from BRCA.As shown in Fig. 3B, the performance was notably higher in seven cancers (excluding HNSC, LGG, LUAD, and STAD) when the survival time data were also generated through ctGAN.Among these, the analysis of six cancer types (COAD, KIRC, LIHC, LUSC, OV, SKCM) revealed statistically significant differences (p-value <0.05).The proposed method considered diverse survival times across different types of cancer, thereby improving the efficiency of survival analysis.
As shown in Fig. 4A, considering the median value from 100 rounds of cross-validation, ctGAN demonstrated enhanced logrank p-values in nine of the 11 cancer types, excluding LUAD and STAD.
In Fig. 4B and C, the KM plots are presented for COAD (the cancer type for which the model showed the most substantial performance improvement, Fig. 3A).Owing to the impracticality of presenting KM plots for all 100 cross-validation results, a split of ∼7:3 was conducted in chronological order to create the training and testing sets.The SuperPC model was trained on the training set.If the partial hazard in the test set exceeded the median partial hazard in the training set, the group was classified as highrisk; otherwise, it was classified as low-risk.
Fig. 4B shows the KM plot when using only real COAD data, whereas Fig. 4C shows the KM plot when real COAD data were combined with the COAD generated by ctGAN.The generated COAD data were merged into the training set for SuperPC model training, and a KM plot was generated using the test set.The results revealed a considerably more significant log-rank p-value of 0.041 when integrating the generated COAD data compared to a p-value of 0.797 when using only the real COAD data.This outcome underscores the effectiveness of style transformation in gene expression and survival data using ctGAN when other clinical conditions remain unchanged.
We conducted additional experiments for COAD.Employing Random Survival Forests instead of SuperPC led to a statistically significantly (p-value <0.002) better C-index than using only real data.Additionally, training ctGAN without identity loss improved performance compared to using only real data.However, the median C-index exhibited a statistically significant (pvalue <0.001) decreased of 8.43% (from 0.759 with identity loss to 0.695 without identity loss).

Style transfer validation
The style transfer performance of ctGAN was evaluated using visualization tools to examine data distributions.Our emphasis was on COAD, for which the model exhibited the highest performance improvement among the 11 cancer types ( Fig. 3A).
Using ctGAN, COAD was transformed into BRCA, and vice versa.A successful style transfer between the two domains implied that the generated data closely resembled the real data from the same domain.In other words, the generated COAD data should be close to the real COAD data, and the generated BRCA data should be adjacent to the real BRCA data.Additionally, the distribution of the reconstructed data must be inspected.Reconstructed data refers to instances wherein the data were transformed into another domain and then returned to their original domain.For instance, reconstructed COAD signifies data that were initially real COAD, transformed into BRCA, and then returned to COAD.Similarly, the reconstructed BRCA represents the data that were initially real BRCA, transformed into COAD, and then reverted to its original BRCA state.The closer the reconstructed data matched the real data, the more effective the style transfer by ctGAN.
Using ctGAN, we can predict the gene expression and survival time of patients with breast cancer that has transformed into colon cancer.Naturally, if a patient's data are transformed back into breast cancer data, retaining the original data is expected to yield accurate results.
As shown in Fig. 5A and C, the generated data closely approximated the real data, confirming effective clustering between BRCA and COAD.Additionally, Fig. 5B and D illustrate that the reconstructed data closely resembled the real data, indicating successful clustering between BRCA and COAD.
As shown in Fig. 6, the data distribution of the real and reconstructed COAD for survival time and the top seven significant genes were remarkably similar.Figures 5 and 6 indicate that the improvement in survival analysis performance by ctGAN was not a result of random data augmentation.Instead, the model generated highly plausible data, which exerted a positive inf luence on survival analysis.
Table 2 presents a comparative analysis of the clustering evaluation metrics, MSE and R-squared (R2) values, for the proposed model and other style transfer-based frameworks across all 11 types of cancers.
The k-nearest neighbor (KNN) purity [56], normalized mutual information (NMI) [57], and adjusted random index (ARI) [58] are clustering evaluation metrics applicable when the class labels are known.If the cancer type is the same, it should have the same class label.These metrics assess the alignment of the clustering results with the actual labels.The values range from 0 to 1, with a value closer to 1 indicating better clustering performance.
Conversely, the Silhouette and Dunn indices are clustering evaluation metrics suitable when class labels are unknown [59,60].The Silhouette index, ranging from −1 to 1, increases with higher intra-cluster similarity and lower inter-cluster similarity.A value closer to 1 indicates a higher clustering performance.However, the Dunn index increases with a smaller maximum   and survival data with GAN.Challenges in implementing deep learning models, including overfitting, arise because of the limited sample size relative to the large number of genes.Although previous methods, trVAE and stVAE, attempted to address this issue, their applicability for clinical purposes is limited, as they generate only gene expression.By contrast, ctGAN enhanced cancer survival analysis by augmenting data via a style transformation between BRCA and 11 other cancers.We assessed the improvement in the C-index by comparing it with that of previous models, demonstrating the superior performance of ctGAN.ctGAN enhanced the C-index values in nine of the 11 cancer types, and outperformed previous models in seven of the 11 cancer types.Notably, the model exhibited a substantial improvement in performance for COAD, with the median C-index value increasing by ∼15.70%.Furthermore, the integration of the generated COAD data resulted in a significantly lower log-rank p-value (0.041) compared with using only the real COAD data (p-value = 0.797).Using visualization tools, we observed that the generated and reconstructed data distributions closely approximated the real data.Additionally, in the comparative analysis of the clustering evaluation metrics, MSE and R2 values, ctGAN consistently achieved either the highest or second-best results across various metrics (89.62%).
Our method aims to address the challenge of insufficient biomedical research data by providing significant assistance and innovation.Subsequent predictive models are likely to demonstrate increased reliability, accuracy, and robustness by generating gene expression and survival data for effective deep learning model training.Furthermore, we anticipate that ctGAN can significantly contribute to the medical field by assisting in the prediction of disease progression and the selection of personalized treatments, thereby generating substantial clinical impact.For instance, by capturing the differences between various cancers, it could help predict the likelihood of a breast cancer patient developing other types of cancer and their expected responses to different treatments.

Key Points
• This study proposes a style-transfer deep generative model, ctGAN, to address existing challenges in the implementation of deep learning models for analyzing gene expression.• Previous models have limited applicability for clinical purposes, but ctGAN enables the combined transformation of gene expression and survival data and improves survival analysis by augmenting data through style transformations.
• ctGAN demonstrates high plausibility in data generation based on distribution and clustering evaluations.The proposed method may enable predictions regarding the likelihood of a patient with breast cancer developing other types of cancer and responding differently to various treatment methods.
involves the training of two generators, G AB : A → B and G BA : B → A, along with two discriminators, D A and D B .Fig. 1C (i) illustrates the transformation of data from domain A to domain B, whereas Fig. 1C (ii) shows the transformation from domain B to domain A. A fundamental concept in cyclic domain translation is cycle consistency: for each sample a ∈ A that undergoes a transformation from domain A to domain B and then returns to domain A, it should remain identical to the original sample, i.e. a → G AB (a) → G BA (G AB (a)) ≈ a.Similarly, for each sample b ∈ B, b → G BA (b) → G AB G BA (b) ≈ b.Additionally, the concept of identity loss indicates that for each sample b ∈ B that is fed into G AB as input, the output should be identical to the original sample, i.e. b → G AB (b) ≈ b.Therefore, cycle-consistent adversarial networks have three types of losses: GAN, cyclic, and identity losses.

Figure 1 .
Figure 1.ctGAN architecture and underlying mechanisms.(A) ctGAN transforms gene expression and survival time data across different types of cancer.(B) i) generator structure of ctGAN.ii) structure of the residual block within the generator of ctGAN.iii) discriminator structure of ctGAN.(C) Cyclic-consistent adversarial network scheme.i) Transformation from cancer type A to cancer type B. ii) transformation from cancer type B to cancer type A.

Figure 2 .
Figure 2. Flowchart of the style transfer process, which comprises data preprocessing, gene selection, model training, and data generation.

Figure 3 .
Figure 3. C-index improvement across 11 types of cancer.(A) Comparison of C-index enhancements among ctGAN, trVAE, and stVAE.(B) Comparison of enhancement in C-index between generating gene expression and survival time using ctGAN versus generating only gene expression while retaining the original survival time from BRCA.The performance estimation in both (A) and (B) were measured with 100 cross-validation iterations.The dash line in both (A) and (B) represents the scenario where only real data were used ( * p-value <0.05 when the ctGAN achieves the best performance).

Figure 4 .
Figure 4. KM plot and log-rank test p-value.(A) Augmenting real samples with those generated by ctGAN shows enhanced log-rank test p-values across 11 cancer types, as indicated by -log10(log-rank p-value).Estimation performances were measured with 100 iterations of cross-validations.The dash line represents the scenario where only real data were used.(B) KM plot using only real COAD data.(C) KM plot using both real COAD and generated COAD data by ctGAN.

Figure 5 .
Figure 5. Visualization of real, generated, and reconstructed samples using t-SNE for COAD and BRCA gene expression.(A) Real and ctGAN-generated gene expressions for COAD visualized with t-SNE perplexity set to 50.(B) Real and ctGAN-reconstructed gene expressions for COAD visualized with a t-SNE perplexity set to 50.(C) Real and ctGAN-generated gene expressions for BRCA and COAD visualized with a t-SNE perplexity set to 50.(D) Real and ctGAN-reconstructed gene expressions for BRCA and COAD visualized with a t-SNE perplexity set to 50.

Figure 6 .
Figure 6.Violin plot representing the distribution of real data and reconstructed data by ctGAN for the top seven significant genes in COAD survival analysis, including survival times.

Figure 7 .
Figure 7. Improvement in C-index across 11 types of cancer when using ctGAN on the SCAN-B dataset.

Table 1 .
Cancer abbreviations, full names, total sample counts, and event occurrence counts for the twelve types of cancer considered in the study